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P-h ' Abstract 

The full-potential linearized augmented-plane wave (FP-LAPW) method is well known to enable 
most accurate calculations of the electronic structure and magnetic properties of crystals and surfaces. 
The implementation of atomic forces has greatly increased its applicability, but it is still generally 
£ — ' believed that FP-LAPW calculations require substantial higher computational effort compared to the 

£SJ ■ In the present paper we analyse the FP-LAPW method from a computational point of view. 

Starting from an existing implementation (WIEN95 code), we identified the time consuming parts 
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pseudopotential plane wave (PPW) based methods. 
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^^ . and show how some of them can be formulated more efficiently. In this context also the hardware 

architecture plays a crucial role. The remaining computational effort is mainly determined by the 
setup and diagonalization of the Hamiltonian matrix. For the latter, two different iterative schemes 
are compared. The speed-up gained by these optimizations is compared to the runtime of the "orig- 
inal" version of the code, and the PPW approach. We expect that the strategies described here, can 
also be used to speed up other computer codes, where similar tasks must be performed. 
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PROGRAM SUMMARY 

Title of program extension: wien-speedup 

catalogue number: ... 

Program obtainable from: CPC Program Library, Queen's 
University of Belfast, N. Ireland (see application form in this 
issue) 

CPC Program Library programs used: cat. no.: ABRE; title: [4] 
WIEN; ref in CPC: 59 (1990) 399 

Licensing provisions: none 

Computer, operating system, and installation: 

• IBM RS/6000; AIX; Fritz-Haber-Institut der Max- 
Planck-Gcsellschaft: Berlin. 



Operating system: UNIX 

Programming language: FORTRAN77 
(non-standard feature is the use of ENDDO) 

floating point arithmetic: 64 bits 

Memory required to execute with typical data: 
64 Mbyte (depends on case) 

No. of bits in a word: 64 

No. of processors used: one 

Has the code been vectorized? no 

Memory required for test run: 64 MByte 

Keywords 

density-functional theory, linearized augmented plane wave 
method, LAPW, supercell, total energy, forces, structure op- 
timization, molecular dynamics, crystals, surfaces, molecules 

Nature of the physical problem 

For ab-initio studies of the electronic and magnetic proper- 
ties of poly-atomic systems, such as molecules, crystals, and 
surfaces. 

Method of solution 

The full-potential linearized augmented plane wave (FP-LAPW) 
method is well known to enable accurate calculations of the 
electronic structure and magnetic properties of crystals 111 2, 
I' i i I' @> i i 0' Within the supercell approach it 
has also been used for studies of defects in the bulk and for 
crystal surfaces. 
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LONG WRITE-UP 
1 Introduction 

The augmented plane wave (APW) method pL 0, 0, rc| and in particular its linearized form, the LAPW 
method ^ ||, |[ [h], O, [l^, [D| [l4|, |l5| , enables accurate calculations of electronic and magnetic properties 
of poly-atomic systems using density-functional theory (DFT) |16|, [P?]] . One successful implementation 
of the full-potential LAPW (FP-LAPW) method is the program package WIEN, a code developed by 
Blaha, Schwarz and coworkers [[u] . It has been successfully applied to a wide range of problems such as 
electric field gradients [|18|, [19) and systems such as high-temperature superconductors [£0| , minerals |2l|] , 
surfaces of transition metals p2j , or anti-ferromagnetic oxides p3| and even molecules p4[ . Minimizing 
the total energy of a system by relaxing the atomic coordinates for complex systems became possible 
by the implementation of atomic forces p3, and even molecular dynamics became feasible. Up to now 
the main drawback of the FP-LAPW-method compared to the pseudopotential plane-wave (PPW) (e.g. 
Ref. |25| and references therein) approach has been its higher computational expense. This may be 
mainly due to a discrepancy in optimization efforts spent on both methods, and therefore we have anal- 
ysed the FP-LAPW method from a computational/numerical point of view. Starting from the WIEN95 
implementation p6[, we identified the time consuming parts and will show how some of them can be 
formulated more efficiently. In this context also the influence of the underlying hardware architecture 
will be discussed. 

The remainder of the paper is organized as follows. After introducing the principles of DFT and summariz- 
ing the concepts of the FP-LAPW-method (Section g and Section ||), we will report on our improvements 
made on the WIEN95 implementation of the FP-LAPW-method (Section [|). In Section [3] we will show, 
how these improvements make the FP-LAPW-method a strong competitor to the popular PPW approach 
by comparing the run-times necessary to converge a 9 layer slab of (4x2)-Cu(110) (i.e. 72 atoms and 
792 valence electrons) using both methods. 



2 Density-Functional Theory 

The central statement of DFT is, that the problem of finding the ground-state energy of a many-particle 
system, characterized by a many-particle wavefunction \&0i can be mapped on a physically equivalent 
problem of finding the ground-state electron density no, i.e. 

E[$ ] = E[n ] with (1) 

N 

n (r) = (*o|£<5(r-r a )l*o) , (2) 

a 

where r a is the coordinate of the a-th electron. The central statement of the Hohenberg-Kohn theo- 
rem [jl 6|| is, that for an N electron system the functional E[n] is minimized by the ground-state electron 
density, no. 

E[n ] = Minion] with the constraint (3) 



nd 3 r = N . (4) 

In the Kohn-Sham formulation the functional E[n] is split into the following terms: 

E[n] = T s [n] + U[n] + E xc [n] , (5) 

the kinetic energy functional of non-interacting particles, T s [rc], the functional of the electrostatic energy, 
U[n] and the rest, called exchange-correlation energy, E xc . With Eq. (||), i.e. with the introduction of 
the functional T s [n], the variational problem of Eqs. (0),(H) becomes becomes equivalent to the problem 
of solving a system of single-particle equations, called the Kohn-Sham equations |17fl , 
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<Pi = ti<Pi (6) 



= 'jLffifiVi ■ (7) 



Here, — 2^-V 2 is the single-particle kinetic energy operator and V g is the potential defined by the 
functional derivative of U[n] + E xc [n], 

_ m + E^ 

on 

The electron density is obtained from Eq. (pi), where fi are the occupation numbers given by the Fermi 
distribution. In practice Eqs. (g), (Q) and (||) are solved in a selfconsistent field (SCF)-cycle: i.e. starting 
with density n\ one calculates the potential V g, solves Eq. (0) and by evaluating Eq. (0) one obtains the 
new density 712, which leads to the next iteration cycle. 

3 The FP-LAPW-Method 

In the augmented plane-wave (APW) method space is divided into an interstitial region (IR) and non- 
overlapping muffin-tin (MT) spheres centered at the atomic sites W. This allows an accurate description 
of both, the rapidly changing (oscillating) wave functions, potential and electron density close to the 
nuclei as well as the smoother part of these quantities in between the atoms. In the IR the basis set 
consists of plane waves exp(iK • r). The choice of a computationally efficient and accurate representation 
of the wavefunctions within the MT spheres has been discussed by several authors, e.g. ||, 0, ||, [U| - 
In the original APW formulation introduced by Slater fy, H > the plane- waves are augmented to the 
exact solutions of the Schrodinger equation within the MT at the calculated eigenvalues. This approach 
is computationally expensive because it leads to an explicit energy dependence of the basis functions 
(and consequently of the Hamilton- and overlap-matrices) and thus to a non-linear eigenvalue problem. 
Instead of performing a single diagonalization to solve the KS equation one repeatedly needs to evaluate 
(for many trial energies) the determinant of the secular equation in order to find its zeros and thus the 
single particle eigenvalues e^. Going into the complex energy plane would have been one option but was 
not explored so far, except in an other context (see e.g. || and references therein). 

In the linearized APW method the problem of the energy dependence of the basis set is removed by using 
a fixed set of suitable MT radial functions R, 0, [l(J. Within Andersen's approach, used also in the WIEN 



code, inside each atomic sphere I and for azimuthal quantum number I the radial solutions uj(ej,ri) of 
the KS equation at fixed energies ej and their energy derivatives uj(ej,ri) are used as basis functions. 
Basically, this choice corresponds to a linearization of the energy dependence of uj(e, r) around ej |LQ ], 
The concept implies that the radial functions uj (ei) and uj (e;) and the respective overlap and Hamilton 
matrix elements need to be calculated only for a few energies ej . Moreover, all KS energies ej are found, 
for each k-point, by only one diagonalization (for a detailed discussion see [fL5|). 
The LAPW basis functions 4>g(t, k) which are used for the expansion of the KS wavefunctions 

V*(r,k)= V Cl (k + G)0 G (r,k) (9) 



k+G|<G wf 



arc defined as 
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Here, G denote the reciprocal lattice vectors and k a vector within the first Brillouin zone. The wave 
function cutoff G wf limits the number of the G vectors and thus the size of the basis set. The symbols in 
Eq. (|l(j) have the following meaning: fi is the unit cell volume, s/ is the MT radius, and 17 = r — R/ is a 
vector within the MT sphere of the I-th atom. Note that F; m (f) represents a complex spherical harmonic 
with Yj_ m (f) = (— V) m Yi m (v). The radial functions ui(ei,r) and ui(ei,r) are solutions of the equations 

H s ^ Ul {e h r) = e,tti(c,,r) (11) 

H^u^eur) = [e,«,(ej,r) + ui(e h r)] (12) 

which are regular at the origin. The operator H sph contains only the spherical average, i.e., the I = 
component, of the effective potential within the MT. The e; should be chosen near the center of the energy 
band with the corresponding /-character. The coefficients a; TO (k + G) and bi m (k + G) are determined by 
requiring that value and slope of the basis functions are continuous at the surface of the MT sphere 
The representation of the potential and electron density resembles the one employed for the wave func- 
tions, i.e., 

n cS (r) =\ ' lm ff , N (13) 

I J2 n G exp(iG -r), reIR . v ' 

L |G|<GP ot 

Thus, no shape approximation is introduced and therefore such an approach is called a full-potential 
treatment. The quality of this description is controlled by the cutoff parameter G pot for the lattice 
vectors G and the number of the (Z,m)-terms included inside the MTs. 

4 Improving the WIEN code 

4.1 Optimization strategies 

To achieve an optimal performance of a computer code on modern computers, it is essential that the 
used algorithms match the underlying hardware architecture. On todays computers, often the memory 



bandwidth is the limiting factor, i.e. the floating-point operation units are stalled, waiting for data. 
Then the performance is not determined by the number of floating point operations per second, but by 
the necessary number of load/store operations. Therefore a significant objective of optimizing a code 
is to reduce the communication between the processor and the relatively slow memory, but to make 
optimum use of the fast cache. Thus the well known fact for parallel computers, that an efficient use 
of communication is crucial for complex and time consuming calculations, holds also on stand-alone 
workstations. The best way to improve the performance of a program on a wide range of architectures 
without loosing portability, is to write the code in such a way that the bulk of the calculations is performed 
by calls to the well known basic linear algebra subprograms (BLAS) |27], [2J| ; efficency can then be obtained 
by using optimized implementations of these routines, specifically tailored to the hardware used. While on 
vector machines, the so called Level 2 BLAS routines (matrix-vector-operations) lead to very satisfactory 
results, this approach is often not well suited for architectures of modern high-performance workstations 
or shared memory systems with a hierarchy of memory (registers, cache, local memory, swap space). For 
those architectures it is preferable to partition the matrices into blocks and to perform the computation 
by matrix-matrix operations on these blocks. This leads to a full reuse of data already held in cache 
(or local memory) and reduces data movement. While for Level 1 (vector- vector-operations) and Level 
2 BLAS routines the number of load/store operations is proportional to the number of floating-point 
operations, the Level 3 (matrix-matrix-operations) approach [£9| gives a surface-to- volume effect, i.e. if 
the matrices are of order n, the number of floating-point operations is of order n 3 , while the number of 
load/store operations is of order n 2 . This minimizes the influence of a limited memory bandwidth on the 
performance of the program. Therefore the goal in optimizing the code must be to use Level 3 BLAS 
routines as much as possible. 

4.2 The structure of the WIEN-code 

The SCF cycle of the WIEN code consist of five independent programs: 

1 . LAPWO : generates the potential from a given charge density 

2. LAPW1 : computes the eigenvalues and eigenvectors 

3. LAPW2: computes the valence charge density from the eigenvectors 

4. CORE: computes the core states and densities 

5. MIXER: mixes the densities generated by LAPW2 and CORE with the density of the previous iteration 
to generate a new charge density 

From these programs LAPW1 and LAPW2 are the most time consuming, while the time needed to run 
CORE and MIXER are basically negligible. Further inspection showed, that for example on IBM RS/6000 
nodes the performance of LAPW2 was far below the theoretical peak performance, which indicates a poor 
adaptation of the code to this hardware architecture. The optimizations done on LAPW2 are described 



in Section 4.2. The situation was different in the case of LAPW1. Due to the use of standard library 



routines the diagonalization of the matrix, which is the most time consuming part, performes quite 



well on IBM RS/6000 nodes. However, on several other hardware platforms with substantially slower 
memory bandwidth, the performance was not so good and those routines were also modified to increase 
performance. Thus further improvement on IBM RS/6000 could only be reached by implementing a 
new algorithm. Based on the fact that the matrix to be diagonalized changes only little from iteration 
to iteration during the selfconsistency cycle, an iterative diagonalization scheme could be an attractive 
alternative. We implemented two such schemes, which use the information from the previous step to 



speed up the diagonalization. The details will be described in Section 4.4 



4.3 LAPW2: Generating the electron density 

In LAPW2 the eigenvalues and eigenvectors found by LAPW1 are read in. The k-space integration over the 
Brillouin zone (BZ) is replaced by a finite k-summation, in which each k-point contributes with a weight, 
Wj(k), in which for convenience also the occupation factor of state ej (i.e. the Fermifactor) is stored. 
First the Fermi energy and then the expansion of the valence electron density is calculated for each of 
the occupied states at all k-points in the irreducible part of the BZ. The valence electron density consists 
of two types of components: the electron density inside each sphere /, rc/(r), represented in spherical 
harmonics on a radial grid and the interstitial electron density, n IR (r) expressed as Fourier series. 

4.3.1 The electron density inside the MT-spheres 

The valence electron density inside a sphere is given by the expression: 

n'ir) = Y. n ^n»Ari)Y L M{vi) r, < sj (14) 

l"m" 

= E^wEEE 

kj Im I'm' G,G' 

{c*(j, k + G) of* (k + G) «,(r) c(j, k + G') a(, m ,(k + G') u[{r) 
+c*(j, k + G) &&(k + G)u,(r) c(j, k + G') a\, m , (k + G') «{(r) 

+c*(j,k + G)a I ^(k + G)u t (r)c(j,k + G')b I l , m ,(k + G')u' l (r) 

+c*(j,k + G)bt(k+G)u l (r)cU,^ + G')b I l , m ,(k + G')u' l (r) }Y* n {v)Y llm ,(v) . (15) 



With the definition 



A\ mj {k) := ^cC?,k+G)of m (k+G) (16) 

G 

B( mj (k) := ^c(j,k + G)6L(k+G) (17) 



G 



the electron density reads: 



/(r) = ^^wEEK-w^^^'W*^) 

kj' Im I'm' 





Figure 1: The calculation of the interstitial electron-density in k-space can be regarded as matrix-matrix- 
multiplication h = C T WC* , where W consists of j identical vectors Wu(j)- 



+Bimj ( k )4m'j i k )ui Mi*/' (r) + Mnj ( k ) B i'm'j ( k ) u i (r)u,, (r) 
+B^ (k)S/, m/j (k)u, (r)t* (r) } F^ (f )T^ m , (f) . 



(18) 



It is obvious that the calculation of the sums (|lj) , ( fl7|) which run over all G- vectors for every combination 
of (7, j, im), will be the most time consuming part, and thus needs special care to implement it efficiently. 
The straight forward implementation of the summation, as done in the original WIEN code, results in a 
high ratio of load/store operations per floating-point operation and a very poor performance. A closer 
look shows that these formulas can be rewritten in the form of a matrix-matrix-multiplication: 



A i;k (j,lm) = ^c(j,k + G)a 7 (k + G,;TO) 

G 

B z ^(j,lm) = ^c(j,k + G)& J (k + G,;™) 



(19) 
(20) 



In this way the matrices ^4 /,k (j, Im) and B 1 ' (J, Ira) can be calculated using optimized (BLAS-3) library- 
routines, hereby reducing the number of load/store operations as well as minimizing the number of cache 
misses. 

4.3.2 The Interstitial Electron Density 

The valence electron density in the interstitial region is given by: 



n 1H {v) 



E ™k ( r ) exp(iK • r), r e IR 

|K|<KP ot 

= E E WkCflCktf, G)4(j, G') exp(*(G - G') • r) 



kj GG' 



(21) 
(22) 



where the sum over the occupied states j can again be regarded as matrix-matrix-multiplication (see 
Fig. [y), in which the matrix W consists of j identical columns Wu.(j): 



n k (G,G') := V W k (j)(£(G, j) <&j, G') (23) 

= X>£(G,j)c k (j,G') (24) 

3 

n ( r ) = EE^G.G'Jexp^G-GO-r) (25) 

k GG' 

Since Wj(k) is real, the matrix n(G, G') is hermitian, i.e. n(G, G') = n.*(G', G). Therefore the calcula- 
tion of the matrix 



n k (G,G')=^^(G,JK(i,G') (26) 

j 

by a single matrix-matrix-multiplication would result in twice as much floating-point operations as nec- 
essary, which would destroy the advantage of using optimized library routines. 

To profit from both, the hermiticity of the matrix and the use of optimized BLAS-3 library routines, the 
matrix is divided into small blocks (Fig. ||). Each block above the diagonal is evaluated by a single (BLAS- 
3) matrix-matrix-multiplication according to Eq. |2J and the result is also used for the corresponding 
block below the diagonal. The elements of the blocks along the diagonal are evaluated by a direct 
implementation of the summation: 

n k (G,G') = n k (G',G)=^ c £(G,j) Ck (j,G'), G<G' (27) 



The blocksize is a free parameter which has to be optimized according to the cache size of the specific 
platform. 

4.4 LAPW1: Setup and diagonalization of the eigenvalue problem 

4.4.1 Setup of H and S 

According to Eq. |] the KS eigenstates are characterized by a set of expansion coefficients Ci(k + G) {i = 
1, . . . ,N S }, where N s are the number of eigenstates to be calculated. In the following, these expansion 
coefficients are viewed as eigenvectors (of length N pw ) of the generalized eigenvalue problem 

(H - e,S) c t = (28) 

where H is the Hamiltonian and S the overlap matrix. The elements of H and S are given by 

Ha - {&\H\4> 3 ) (29) 

S lJ = (&|&) (30) 
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Figure 2: The hermitian matrix nk(G, G') is divided into small blocks. Each of the blocks is calculated 
by a matrix-matrix-multiplication. 



where tfij are the LAPW bsis functions. As discussed earlier, one of the main ideas of the FP-LAPW 
method is to construct sophisticated basis functions ip which provide a good approximation to the true 
wave function ifj, so that the number of basis functions N pw required to expand ip with reasonable 
accuracy, is kept small. The main drawback of this approach is that the evaluation of Eq. ( p9fJ30| ) is 
quite demanding. A simple way to reduce the computational effort in setting-up H is to consider in 
the first half of the self consistency cycle only the spherical average of the potential (i.e. the LM=(0,0) 
component). Furthermore a considerable speedup on IBM RS/6000 nodes was obtained by using an 
IBM specific mathematical library |3l| which allows a much faster evaluation of trigonometric functions 
that are required in Eq. (|29-30). These subroutines computes the trigonometric functions for a vector 
of arguments, hereby minimizing the computational costs compared to the serial evaluation of all vector 
elements. 
A combination of these procedures can significantly speed up the generation of H and S matrices. 



4.4.2 Solving the eigenvalue problem 

As noted before, the standard diagonalization routines could not be improved significantly on IBM 
RS/6000 nodes, since the modifyed LAPACK routines together with IBMs highly optimized scientific 
ESSL library yields already almost optimal performance. On other hardware platforms (e.g. SGI Power 
Challenge, DEC- Alpha, Intel PII) with slower memory bandwidth we could achieve a speedup of the 
diagonalization by more than a factor of two by modifying the standard LAPACK routines using a 
hierarchical blocking scheme as described in pij . 



4.4.3 Iterative diagonalization 

In contrast to the LAPW method, the plane wave basis set used in the PPW method allows an easy 
evaluation of Eq. (p9[]30|), but the number of expansion functions is much larger. For this reason the 
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approach to an iterative matrix diagonalization described below is somewhat different from the one 

usually adopted in the PPW method. 

We implemented two schemes of iterative matrix diagonalization, namely the Block-Davidson and the 

Lanczos algorithm. As both methods are fairly well known, here only general aspects will be discussed, 

as far as they concern the FP-LAPW method. For a detailed discussion see e.g. (]30f| - 

Since the KS equation must be solved self-consistently, the matrix C of the eigenvectors Cj in Eq. ( P8[) are 

always available (with the exception of the first cycle) from a previous cycle, C° ld . Therefore C old can be 

used to obtain an approximate solution to Eq. (Eq). If H ncw (S ncw ) is the Hamiltonian (overlap) matrix 



of the present iteration, then Eq. (28) can be transformed into the space spanned by the old eigenvectors. 
This would be no approximation to Eq. (ES) if one would include all eigenvectors, because the old and 
new eigenvectors span the same space. In practice, however, the number of calculated states, N s , is much 
smaller (by almost an order of magnitude) than the matrix size, N pw . If one would choose N s equal to 
the number of occupied states in the solid, N occ the new eigenvectors would not be improved at all, since 
the new eigenvectors, C now , would simply be a linear combination of the old eigenvectors. Here, we take 
N s = 2N occ which was found to be a good compromise between accuracy and numerical effort. 
The old eigenvectors are now viewed as an unitary transformation 

C oldt C old = S . (31) 

In the case S — E (no overlap, E unity matrix) Eq. ([3l|) always holds. In the general case Eq. ( pl| ) 
is only valid, if S ncw ~ S old . This aspect must be especially considered in the case of the LAPW 
method because the basis functions are recalculated in each iteration. This problem can be overcome 
by transforming the generalized eigenvalue problem to a regular one (i.e. by Cholesky decomposition). 
Here, we chose to treat the generalized problem with the Block-Davidson scheme and the regular problem 
using the Lanczos algorithm. The reason for this strategy is the following: The Lanczos algorithm has 
due to its simplicity a very low numerical cost and thus can compensate for the extra cost of the Cholesky 
decomposition. Treating the overlap matrix S explicitly would require to orthogonalize the sets H l B % ~ 1 
using S as a metric tensor which would ruin the numerical effort saved by not doing the decomposition. 
The reduced eigenvalue problem is then given by 

H = eSC ncw with (32) 

H = c old ^HC old , (33) 

S = C oldt SC old and (34) 



The process of iterating the solution of Eq. (p2j) consists of optimizing the N s basis functions initially 
given by C new by adding N s — N occ linear independent vectors to this set. In the subsequent discussion 
the set C ncw consisting of iV s basisvectors will be named B° and the set of JV S basisvectors added in 
iteration i, B l . The actual iteration procedure then consists of using {B ,. . . ,_B 1 } in Eq. (|32|-J34|), to 
construct B l+1 from {B° : . . . ,B 1 } and turn back to Eq. (|32|-p4|). At the end of the iteration process the 
eigenvectors are obtained from Eq. (pq). Here, the set {B ,. . . ,B 1 } is viewed at as a rectangular matrix 
of size iVp W x (i + l)N s . 

11 



AAA Lanczos Scheme 

As already mentioned above, we now take S = E. The basic idea is to improve B l by the N s vectors 
obtained from calculating HB 1-1 and orthogonalizing this set to the set B % ~ 1 (e.g. by Gram-Schmidt 
orthogonalization) . In fact this is one of the easiest ways to increase the basis set, because in practise 
HB % ~ X had to be calculated already in Eq. (B3). To our knowledge, a strict mathematical proof that the 
series HB,H 2 B, . . . ,H n B should converge to the eigenvectors of H does not exist, but experience has 
shown that this approach is fairly stable and accurate. 

4.4.5 Block-Davidson Scheme 

This scheme uses a more subtle way to expand the basis B. In iteration i one gets from Eq. ( |32| - |35| ) a 
current approximation to the true eigenvector \cj), denoted as |c*). The aim is to find a correction vector 
\SA) such that 

\c 3 ) = \c)) + \6A) . (36) 

This correction vector \5A) can be formally calculated by plugging Eq. ( pq ) into Eq. (|28|). 

(H-e 3 S)\c)) = (H-ejSWAj) (37) 

The left side of Eq. (p7J) is called residual vector, \Rj). In principle the inversion of (H ~ eS) would 
yield the correct \5A), but in practice this is never done because the computational cost of this inversion 
would already be comparable to an exact diagonalization. Thus, one only retains the diagonal elements 
of (H — eS) to make the inversion trivial. Eq. J37|) is then expressed with help of the basis B l 



*w - ?(^fW |st) «-) 



H-e 3 
The matrix containing the |<L4i), . . . , \8An s ) is then used to increase the basis B % to B l+1 . 

5 Examples 

In the following we demonstrate the effect of our improvements on a huge example, namely a 9 layer 
slab of (4x2)-Cu(110) (i.e. 72 atoms, 792 valence electrons). We will compare the CPU-time needed 
to reach selfconsistency using our improved code with the original code. Additionally we will compare 
our LAPW code with a most efficient implementation of the PPW-method [p5|. The Cu(llO) surface is 
modelled by a nine layer slab repeated periodically in all three dimensions and separated by a vacuum 
zone equivalent to five substrate layers. We use a lattice constant of 6.64 bohr, which corresponds to 
the theoretical LDA bulk value. Since both methods scale almost linearly with the number of k-points, 
only one point in the surface BZ has been used for these benchmarks. The MT radii are chosen to be 
2.20 bohr. The kinetic-energy cutoff for the plane wave basis needed for the interstitial region is set to 
13.22 Ry which leads to matrix-sizes of the hamiltonian matrix of about 7000x7000. The partial wave 
(/, m) representation (inside the MTs) is taken up to l max — 10. A plane-wave cutoff energy of 81 Ry for 
the Fourier representation of the potential is used. The maximum angular momentum in the (L,M) 
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expansion of the potential inside the atomic spheres is set to L max = 4. In the PPW calculations, plane 
waves up to a kinetic energy of 70 Ry had to be used, to reach a comparable level of accuracy, but we 
also include the CPU-time required for a PPW calculation at 40 Ry. 

5.1 LAPW2: Generating the electron density 

The original code WIEN95 needed 19680 CPU seconds (5h 28m) for the generation of the electron density 
on an IBM RS/6000 node (Table |l|). The calculation of the electron density inside the spheres took 8640 
CPU seconds (2h 24 m) (44%), while 11040 CPU seconds (3h 4m) (56%) were needed for the interstitial 
electron density. On this latter part our improvements led to a reduction of the necessary CPU-time 
to 322 CPU seconds (5m 22s), which is equivalent to a speed-up factor of 34. In the part generating 
the electron density inside the MT-spheres, the improvement is not as big, but still a speed-up factor 
of 12 could be reached, reducing the CPU-time to 740 seconds (12 m). The relative weight of the two 
tasks is shifted by the optimization to 70% for the spheres and 30% for the interstitial. With these 
improvement, the contribution of lapw2 to the overall runtime becomes negligible, and thus all further 
considerations should focus on the program lapwl and it's most time consuming part, the diagonalization 
of the Hamilton-Matrix. 





WIEN95 


optimized 






CPU-time 


% 


CPU-time 


% 


speed-up 
factor 


Spheres 
Interstitial 


2h 24m 
3h4m 


44 
56 


12m 20s 

5m 22s 


69 

31 


12 
34 


Total 


5h 28m 




17m 42s 




18.5 



Table 1: Distribution of CPU-time needed for the different parts of the generation of the new electron 
density (lapw2) comparing the original version ( "WIEN95" ) with the new one ( "optimized" ) . The column 
("speed-up factor") lists the speed-up reached. 



5.2 LAPW1 

As a general result it was found that in order to obtain reasonable accuracy in total energies it is sufficient 
for both methods, the Block-Davidson as well as the Lanczos scheme, to improve the expansion set only 
once, i.e. using {B ,!} 1 } to construct the new eigenvectors. Both iterative schemes worked well for the 
Cu(110) benchmark system. The speed-up gained with respect to the full diagonalization was 1.45 in 
the case of the Lanczos scheme and 3.12 in the case of the Block-Davidson scheme (Table El). Fig. |5.2| 
illustrates the accuracy of both methods during the SCF-cycle. In the upper panel the overall performance 
is illustrated: The left panel shows the deviation of the total energies obtained by both methods with 
respect to the exact diagonalization result. Here, the largest deviation is about 1 mRy, but when self- 
consistency is approached, the deviations are well below the convergence criterion of 0.5 mRy. The right 
panel shows in an analogous way the deviations of the electron differences (the mean square deviation of 
"-old — n now inside the MT's) during the SCF-cycle. This gives an idea about the overall quality of the 
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approximated eigenvectors. The deviation in the total energies are less than 0.3 mRy and thus show no 
essential difference between the two schemes, but the electron difference indicates that the Block-Davidson 
method leads to better eigenvectors especially during the first four cycles of the SCF-cycle. The quality 
of the eigenvectors is illustrated in more detail in the lower panel of Fig. |5.2| for the valence electron 
densities only. In the left (right) panel the mean square deviation between n OX act — "iter is evaluated 
inside the MT's (interstital region), where "iter" stands for either the Davidson or the Lanczos method. 
It can be clearly seen that the Davidson method leads to results that are closer to the exact solution than 
the results obtained by the Lanczos-method, but again, when self-consistency is reached, both methods 
give essentially the same results for the interstitial region as well as inside the MT. 





WIEN95 


optimized 






CPU-time 


CPU-time 


speed— up 

factor 


spherical (H , S) 
non-spherical (H) 
Diagonalization 
Diagonalization 


43m 17s 

37m 13s 

lh 12m 5s 


16m 29s 

18m 36s 

Lan: 49m 42s 

Dav: 23m 07s 


2.62 
2.00 
1.45 
3.12 


Total 


2h 32m 35s 


Lan: lh 24m 47s 
Dav: 57m 12s 


1.80 
2.67 



Table 2: CPU-time in LAPW1 needed for the setup (spherical and non-spherical H and S matrix) and 
the diagonalization; for the original code the standard diagonalization is used ("WIEN95"), while for the 
"optimized" version the timing for both, the Lanczos ( "Lan" ) and the Block-Davidson ( "Dav" ) method 
are given and the non-spherical part of the Hamiltonian (which is ignored for the first half of the iterations 
towards self-consistency) is the average over all iterations. The last column lists the corresponding speed- 
up factors. 



5.3 Total Speed-ups 

Table || shows the distributions of CPU-time needed for the different parts of the LAPW-selfconsistency 
cycle for the original WIEN95 program as well as for our new, optimized code. The enormous speed- 
up factor close to 20 for the program lapw2 indicates, that the original code, which was tuned for a 
vector machine, did not match the needs of modern high performance workstations with fast but small 
cache and relativly slow main-memory access. This extraordinary speed-up could not be gained for the 
program lapwl. However, with the implementation of the new iterative diagonalization algorithms and 
the omission of the non-spherical terms to the Hamilton-matrix in the first half of the selfconsistency run, 
the required CPU-time is cut down by a factor of 2.67. In total all our modifications lead to a speed-up 
of 4.80. 
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Figure 3: Upper panel: Deviations of total energy (left panel) in mRy and electron distance (right panel) 
with respect to the exact diagonalization for the Lanczos (filled boxes) and Block-Davidson (open circles) 
method during the SCF-cycle. Lower panel: Valence electron distance to the exact solution (see text) 
for the MT-contributions (left panel) and plane wave contributions in interstitial region (right panel). 
At the start of the SCF-cycle an exact diagonalization is performed (no deviation) to obtain the input 
wavefunctions for the Lanczos- and Davidson-method, respectively. 

5.4 Comparing the computationally costs FP-LAPW and Pseudopotentials 
Plane Waves codes 

In order to compare our improved FP-LAPW-code with the PPW-approach, we also calculated our test 
system with the highly optimized PPW code fhi96md p5J . 

In this implementation of the PPW method, the Kohn-Sham-equations are solved by an iterative opti- 
mization of a set of trial wave functions, combining self-consistency and iterative matrix diagonalization, 
where the iteration of the wave functions is formulated in terms of equations of motion, as proposed by 
Car and Parinello Q. In the fhi96md code a second order equation is used, which had been suggested 
by Joannopoulos |p3| . In this scheme, each single step is computationally much cheaper than a single 
selfconsistency cycle within the FP-LAPW-method, but the number of iterations needed to reach self- 
consistency is usually much larger and crucially depends on the quality of the initial guess for the wave 



15 





WIEN95 


optimized 






CPU-time 


% 


CPU-time 


% 


speed— up 

factor 


lapwO 

lapwl 

lapw2 

core 

mixer 


26m 
2h 33m 

5h 28m 

4s 
2m 


5 
30 
65 


26m 

57m 

17m 

4s 

2m 


22 
61 

15 

2 


1.00 

2.67 

19.29 


Total 


8h 29m 




lh 46m 




4.80 



Table 3: Distribution of CPU-time needed for the different parts of the LAPW-selfconsistency cycle 
comparing the original version ("WIEN95") with the new one ("optimized"). The last column shows the 
speed-up factor reached. 

functions. For this reason the f hi96md code employs a mixed-basis-set initialization, which gives starting 
wave functions of high quality. For details see Ref . [£5| . 

Table || shows the CPU-time needed to converge our test system using both iterative matrix diagonaliza- 
tion methods. The initialization in the FP-LAPW-method is just the time needed to construct a starting 
electron density, whereas for the PPW-method the time reflects the set up of the starting wave functions 
within the mixed-basis scheme. As already mentioned, the time needed for a single iteration is much 
smaller for the f hi96md implementation of the PPW-approach than in the FP LAPW code, but this 
advantage is destroyed by the fact, that about five times as many iterations are needed to reach selfcon- 
sistency. It should be noted that the meaning of "iteration" is in fact different in the FP-LAPW and the 
PPW method, as the PPW method [£5| combines the iterative diagonalization with the selfconsistent 
update of the electron density. While the original WIEN code needed about 30% more cpu-time than the 
PPW-code to converge this system, our improved version is about three times faster. 
It is important to note that Table [| summarizes benchmark-calculations performed in summer 1997, and 





FP-LAPW 


PPW 




original 


optimized 


70 Ry 


40 Ry 


J- initialization 
-^ iteration 
Tritcrations 


30m 

8h 24m 
20 


30m 

lh46m 

20 


18h 40m 

lh 7m 

100 


8h 45m 

30m 

100 


^total 


168h 34m 


35h 50m 


130h 20m 


58h 45m 



Table 4: CPU-time needed on an IBM RS/6000 node to converge a nine layers slab, representig a 2x2 
Cu(110) surface cell. Comparison of the original WIEN95 code ("original"), our improved code ("opti- 
mized" ) and the fhi96md pseudopotential plane wave program ( "PPW" ) with two different plane wave 
cutoffs ("70 Ry" and "40 Ry"). 
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that the system Cu (110) with a (2) surface structure and 72 atoms per supercell was most favorable 
to identify the advantages of the new FP-LAPW code, and at the same time it was least favorable 
for the plane-wave pseudopotential code fhi96md. In the meantime several improvements are being 
introduced in the pseudopotential code, as for example a real-space projector method |35|| to evaluate 
the pseudopotential matrix-elements (which brings a speed up between a factor of 2 and 3), and ultra- 
soft pseudopotentials J36| (which brings a speed up by another factor of 2). Altogether, for the chosen 
benchmark system the new version of the plane-wave pseudopotential code, fhi99md, is about a factor of 



20 faster, without loss in accuracy |37 . But we also note that for other systems the difference in CPU 
time required for the new, fhi99md, and the older, fhi96md, code is much less pronounced. 
Other plane-wave pseudopotential codes [ pq , p9[ also employ the mentioned improvements and behave 
similar to the fhi99md code. This discussion shows that comparisons between different methods (e.g., 
FP-LAPW versus plane- wave pseudopotentials) is indeed helpful to identify and optimize time critical al- 
gorithms and routines. With ever increasing system size program developments are getting more and more 
important. Although FP-LAPW was ahead the pseudopotential code (with respect to lower CPU time 
consumption) for some systems in 1997 and 1998, recent improvements by introducing new concepts at 
the plane-wave pseudopotentials front make this again a more efficient code. We are convinced, however, 
that new concepts and techniques will also bring a speed up to FP-LAPW. Clearly FP-LAPW remains the 
most accurate tool and does not suffer from problems as linearization of core- valence exchange-correlation 
(which can be partially corrected in pseudopotential calculations), or the lack of core polarization (which 
may be important, e.g. for some magnetic systems). However, besides accuracy low CPU time require- 
ments are clearly very important. A fast (i.e. efficiently) working electronic structure code is crucial for 
present days problems, in particular to be able to test all relevant numerical approximations with the 
required care. We note that in many density-functional theory calculations performed for low symmetry 
and/or many-atom systems the main approximations are (often) not at the level of exchange-correlation 
functional but at the level of numerical approximations. 

Although our test system may be a special case and other systems or a different computer architecture 
may lead to slight modifications, the fair estimate of the relative speed between FP-LAPW and PPW 
should remain valid. 

6 Summary 

The present work demonstrates that a continuous adaption of algorithms to the existing hardware archi- 
tecture is indeed very important for efficient and accurate electronic structure calculations of many-atom 
systems. While the WIEN 95 implementation of the FP-LAPW-method was optimized for a vector com- 
puter and performs well on those platforms, it is not well suited for modern cache-based processors. Our 
improvements led to a significant speed-up on those hardware achitectures and makes the FP-LAPW 
method a strong competitior to the popular PPW approach. Especially for transition metal systems, 
the FP-LAPW method has a significant advantage. In addition the FP-LAPW method gives as an all- 
electron method additional information about the system, which is out of reach for any pseudopotential 
method because of the frozen core approximation. 
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The significant improvements discussed here have been implemented in the new version WIEN97 of the 
FP-LAPW code |4(J and the sucessful strategy adopted here may be useful for other software developers 
too. 
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